Extending a Thesaurus with Words from Pan-Chinese Sources

نویسندگان

  • Oi Yee Kwong
  • Benjamin Ka-Yin T'sou
چکیده

In this paper, we work on extending a Chinese thesaurus with words distinctly used in various Chinese communities. The acquisition and classification of such region-specific lexical items is an important step toward the larger goal of constructing a Pan-Chinese lexical resource. In particular, we extend a previous study in three respects: (1) to improve automatic classification by removing duplicated words from the thesaurus, (2) to experiment with classifying words at the subclass level and semantic head level, and (3) to further investigate the possible effects of data heterogeneity between the region-specific words and words in the thesaurus on classification performance. Automatic classification was based on the similarity between a target word and individual categories of words in the thesaurus, measured by the cosine function. Experiments were done on 120 target words from four regions. The automatic classification results were evaluated against a gold standard obtained from human judgements. In general accuracy reached 80% or more with the top 10 (out of 80+) and top 100 (out of 1,300+) candidates considered at the subclass level and semantic head level respectively, provided that the appropriate data sources were used. © 2008. Licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-ncsa/3.0/). Some rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extending a Thesaurus in the Pan-Chinese Context

In this paper, we address a unique problem in Chinese language processing and report on our study on extending a Chinese thesaurus with region-specific words, mostly from the financial domain, from various Chinese speech communities. With the larger goal of automatically constructing a Pan-Chinese lexical resource, this work aims at taking an existing semantic classificatory structure as levera...

متن کامل

Toward a Pan-Chinese Thesaurus

In this paper, we propose a corpus-based approach to the construction of a Pan-Chinese lexical resource, starting out with the aim to enrich existing Chinese thesauri in the Pan-Chinese context. The resulting thesaurus is thus expected to contain not only the core senses and usages of Chinese lexical items but also usages specific to individual Chinese speech communities. We introduce the ratio...

متن کامل

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Text1

Put forward a new method about automatic Chinese text segmentation based on Chinese characters string (CCS) frequency and length descending. It can automatically segment meaningful CCS in text based on processing longer string first and string frequency information, with no thesaurus, no acquiring the probability between words in advance and no Chinese character index. This method can effective...

متن کامل

Semantic Classification of Chinese Unknown Words

This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words (words not already in the CiLin thesaurus and the Chinese Electronic Dictionary, but in the Sinica Corpus). The focus of the paper differs in two ways from previous research in this particular area. Prior research in Chinese unknown words mostly focused on proper nouns (Lee 1993, Lee, Lee and C...

متن کامل

Combining Multiple Lexical Resources for Chinese Textual Entailment Recognition

Identifying Textual Entailment is the task of finding the relationship between the given hypothesis and text fragments. Developing a high-performance text paraphrasing system usually requires rich external knowledge such as syntactic parsing, thesaurus which is limited in Chinese since the Chinese word segmentation problem should be resolve first. By following last year, in this year, we contin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008